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1  SUMMARY 


When  checking  the  timing  correctness  of  a  system  of  real-time  tasks,  upper  bounds  on  the 
execution  times  of  individual  tasks  are  required.  On  a  multicore  platform,  execution-time  bounds 
that  are  not  overly  pessimistic  are  difficult  to  determine.  A  key  stumbling  block  in  this  regard  is 
the  difficulty  of  predicting  when  memory  references  will  hit  in  on-chip  caches.  While  this  is  a 
problem  even  on  uniprocessors,  the  presence  of  shared  caches  on  multicore  platforms  further 
exacerbates  this  problem.  Such  caches,  which  are  widely  used  in  current  commercially  available 
multicore  machines,  can  be  accessed  concurrently  by  tasks  on  different  cores.  This  creates  cross¬ 
core  cache  interactions  that  are  difficult  (if  not  impossible)  to  predict.  In  fact,  such  difficulties  are 
one  of  the  main  reasons  why  multicore  platforms  are  not  in  widespread  use  in  safety-critical 
domains  such  as  avionics  systems.  In  this  project,  a  new  shared  cache  management  framework  has 
been  developed  that  enables  more  predictable  shared  cache  behavior.  In  this  report,  an  overview  of 
this  framework  is  presented  and  the  results  of  an  evaluation  of  it  are  discussed. 

2  INTRODUCTION 

Multicore  platforms  offer  the  potential  of  enabling  computationally  intensive  workloads  in  many 
settings,  with  less  size,  weight,  and  power  (SWaP)  consumption.  Such  settings  range  from  hand¬ 
held  and  embedded  devices,  to  laptop  and  desktop  systems,  to  the  world’s  fastest  supercomputers. 
In  all  of  these  settings,  the  computational  capabilities  of  multicore  chips  are  being  leveraged  to 
realize  a  wealth  of  new  products  and  services  across  many  application  domains.  One  domain, 
however,  stands  out  as  being  largely  unaffected:  safety-critical  cyber-physical  embedded  systems. 

Examples  of  such  systems  include  avionics  and  automotive  systems,  medical  systems  for  diagnosis 
and  treatment,  and  smart  robotic  systems  that  function  in  environments  where  failures  cannot  be 
tolerated  or  are  difficult  to  correct  (e.g. ,  planetary  rovers).  A  common  characteristic  of  these  and 
other  safety-critical  systems  is  that  failures  may  have  catastrophic  consequences,  such  as  loss  of 
life  or  serious  financial  repercussions.  Because  of  the  high  cost  of  failure,  safety-critical  systems 
must  be  certified  (often  by  governmental  or  international  bodies)  before  being  deployed. 

Certification  can  be  expensive  and  time-consuming.  Thus,  it  is  imperative  that  safety-critical 
systems  be  built  using  certification-friendly  hardware  platforms  and  design  processes.  One  of  the 
most  important  tenets  in  this  regard  is  that  computations  should  be  predictable.  Predictability 
ensures  that  behaviors  arising  during  certification  reflect  those  that  will  be  seen  in  the  deployed 
system.  Predictability  is  also  fundamental  when  establishing  real-time  correctness. 

The  importance  of  predictability  in  certification  explains  why  multicore  platforms  are  not  in 
widespread  use  in  safety-critical  domains.  In  such  platforms,  different  cores  share  hardware 
components  such  as  caches  and  memory  controllers.  Using  current  technology,  very  pessimistic 
assumptions  must  be  made  regarding  the  utilization  of  these  shared  resources  during  certification. 
The  processing  capacity  lost  to  such  pessimism  can  easily  negate  the  impact  of  any  additional 
cores.  The  resulting  state  of  affairs  is  unsettling:  the  multicore  revolution  is  enabling  dramatically 
better  functionality  and  services  in  many  domains,  but  safety-critical  cyber-physical  systems  are 
excluded.  Unless  the  “predictability  problem”  associated  with  multicore  platforms  are  addressed, 
functional  advances  in  such  systems  will  continue  to  be  impeded. 
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Focus  of  this  project.  In  this  project,  we  have  conducted  research  on  this  “predictability  problem” 
with  a  particular  emphasis  on  the  subproblem  of  ensuring  predictable  shared  cache  behavior  on 
multicore  platforms.  Ensuring  such  behavior  is  not  only  important  from  the  standpoint  of  real-time 
predictability,  it  is  a  key  first  step  towards  ensuring  safe  and  secure  isolation  of  disparate  system 
components  in  safety-critical  systems.  Much  of  our  research  agenda  has  been  shaped  by  a 
particularly  compelling  driving  problem:  supporting  real-time  functionality  in  future  unmanned  air 
vehicles  (UAVs).  Such  UAVs  will  have  greater  autonomous  capabilities  than  current  designs. 
Some  of  our  work  on  this  driving  problem  has  been  done  in  collaboration  with  colleagues  at 
Northrop  Grumman  Corp.  (NGC)  [29,  46],  In  avionics  systems  such  as  UAVs,  different  software 
components  may  have  different  criticalities.  The  most  important  outcome  of  this  project  is  a  new 
real-time  shared  cache  management  framework  that  enables  more  predictable  cache  behavior  while 
taking  criticality  information  into  account.  Additionally,  various  other  problems  related  to 
multicore  real-time  resource  allocation  were  addressed. 

Organization.  In  the  rest  of  this  report,  we  elaborate  on  the  UAV  driving  problem  and  our 
contributions.  We  begin  by  discussing  our  methods,  assumptions,  and  procedures  in  Section  3.  We 
then  present  an  overview  of  our  main  results  in  Section  4.  We  conclude  in  Section  5. 

3  METHODS,  ASSUMPTIONS,  AND  PROCEDURES 

In  this  section,  we  describe  our  methods,  assumptions,  and  procedures.  We  begin  by  providing 
necessary  background  on  the  UAV  driving  problem  and  on  real-time  resource  allocation  issues. 

3.1  Driving  Problem:  Future  UAVs 

Complex  adaptive  systems  ( CASs )  have  been  identified  by  Air  Force  scientists  as  key  components 
of  the  Air  Force  of  the  future  [2,  17].  A  typical  CAS  can  provide  a  broader  array  of  capabilities 
than  a  traditional  counterpart  while  also  exhibiting  far  more  autonomy,  thus  providing  a  force- 
multiplying  effect.  Oftentimes,  this  additional  autonomy  not  only  replaces  human  pilots  or 
operators,  but  actually  provides  superior  capabilities.  Examples  of  CASs  include  next-generation 
UAVs,  advanced  radar  systems,  and  fractionated  satellites. 

Key  attributes  of  CASs  include:  a  large  number  of  on-  and  off-board  inputs  (e.g.,  embedded 
sensors);  heavy  reliance  on  the  fusion  of  data  from  multiple  sources;  feedback-driven  decision¬ 
making  and  learning;  and  various  mechanisms  to  enable  adaptation  [17].  Such  systems  can  be  said 
to  have  a  “near-infinite”  number  of  possible  states  and  to  be  “fundamentally  nondeterministic.” 
They  exhibit  a  high  degree  of  self-awareness  and  incorporate  high-level  reasoning  in  order  to 
maximize  their  capabilities. 

Next-generation  UAVs:  The  UCAS-D  vision.  A  canonical  “archetype”  of  a  future  Air  Force 
CAS,  exemplifying  all  the  key  attributes  discussed  above,  can  be  found  in  the  U.S.  Navy’s  Un¬ 
manned  Combat  Air  System  ( N-UCAS)  program  (which  actually  originated  as  a  joint  Air  Force- 
Navy  program,  J-UCAS).  In  the  long-term  vision  espoused  under  this  program,  a  system  of 
stealthy  UAVs  with  miscellaneous  capabilities  (including  strike,  surveillance,  reconnaissance,  and 
radar  jamming)  could  be  deployed  over  enemy  airspace  for  extended  periods  of  time.  These 
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vehicles  are  envisioned  to  employ  sophisticated  software  that  enables  them  to  exhibit  on-board 
intelligence  that  allows  them  to  break  down  and  pursue  high-level  objectives  identified  by  a  small 
team  of  mission  operators.  Further,  autonomously  formed  ad  hoc  teams  of  vehicles  exhibiting 
fractionated  capabilities  ( e.g .,  one  vehicle  may  contain  extra  munitions,  while  another  contains 
special  jamming  equipment)  are  envisioned  to  pursue  complex,  theater-level  objectives.  The 
software  to  be  run  on  these  vehicles  needs  to  incorporate  complicated  high-level  reasoning  that 
takes  into  account  theater-level  information  and  that  provides  for  coordination  of  the  overall 
system  of  vehicles.  It  also  needs  to  be  able  to  autonomously  and  quickly  react  to  incoming  hostile 
forces. 

The  need  for  real-time  multicore  support.  Practically  all  CASs  will  include  sophisticated  tim¬ 
ing  constraints.  Thus,  their  design,  development,  verification  and  validation,  and  certification  must 
not  only  account  for  logical  correctness,  but  also  temporal  correctness.  While  methods  for 
ensuring  logical  correctness  are  clearly  deserving  of  significant  attention,  our  research  has  focused 
purely  on  ensuring  temporal  correctness.  Many  CASs,  including  next-generation  UAVs,  will  also 
have  both  significant  computational  workloads  and  stringent  SWaP  requirements.  These  factors 
suggest  utilizing  a  multicore-based  implementation.  However,  ensuring  the  temporal  correctness 
of  software  typically  requires  sacrificing  a  significant  fraction  of  the  underlying  hardware 
platform’s  capacity;  on  a  multicore  platform,  this  loss  can  greatly  exceed  50%  using  current 
resource  allocation  and  analysis  techniques.  To  describe  the  nature  of  such  capacity  loss,  we  must 
first  more  carefully  consider  how  real-time  workloads  are  scheduled  and  provisioned. 

3.2  Multiprocessor  Real-Time  Scheduling 

When  designing  a  real-time  application,  the  goal  is  to  produce  a  schedulable  system,  i.e.,  one 
whose  timing  constraints  can  be  guaranteed.  Timing  constraints  are  usually  expressed  using  dead¬ 
lines.  To  illustrate,  consider  the  well-studied  periodic  task  model  [42],  In  this  model,  the  specified 
system  is  comprised  of  a  collection  of  recurring  sequential  tasks,  Th  ... ,  Tn,  which  are  to  be  scheduled  on 
m  processors.  Each  such  task  7)  releases  a  succession  of  jobs  Ji  X,  Ji  2,  .  .  .and  is  defined  by 
specifying  a  period  p,  and  a  worst-case  execution  time  (WCET)  e,-.  Successive  jobs  of  7)  are 
released  p,-  time  units  apart,  and  one  that  is  released  a  time  t  has  a  deadline  at  time  t  +  p,-.  Also, 
each  such  job  executes  for  at  most  e,-  time  units  and  its  required  processing  time  should  be 
allocated  between  its  release  and  deadline. 

A  periodic  task  system  may  be  scheduled  by  either  an  event-  or  timer-driven  scheduler.  An  event- 
driven  scheduler  uses  priorities  (e.g.,  job  deadlines)  to  make  online  scheduling  decisions.  A  timer- 
driven  scheduler  uses  a  pre-computed  dispatching  table.  The  discussion  below  mostly  focuses  on 
event-driven  schedulers. 

Real-time  schedulability.  Under  the  periodic  model,  the  definition  of  the  term  “schedulable” 
depends  on  whether  deadlines  are  hard  or  soft  (we  consider  both  possibilities).  In  a  hard  real-time 
(777?  7)  system,  deadlines  can  never  be  missed,  while  in  a  soft  real-time  ( SRT)  system,  deadline 
misses  are  sometimes  tolerable.  SRT  schedulability  can  be  defined  in  different  ways;  we  assume 
here  that  a  SRT  system  is  schedulable  if  deadline  tardiness  is  bounded  (such  bounds  would  be 
expected  to  be  reasonably  small).  In  determining  schedulability  (hard  or  soft),  the  processor  share 
required  by  each  task  7)  is  of  importance.  The  share  required  by  7)  is  given  by  the  quantity  ut  = 
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ej/pi,  which  is  called  its  utilization  or  weight. 

Algorithms  for  testing  schedulability  are  necessarily  dependent  on  the  scheduling  algorithm  being 
used.  Since  our  research  targets  multicore  platforms,  multiprocessor  scheduling  algorithms  are  our 
primary  focus.  Such  an  algorithm  may  follow  a  partitioned  approach  (tasks  are  statically  assigned 
to  processors)  or  a  global  scheduling  approach  (any  task  may  execute  on  any  processor).  An 
example  of  a  partitioned  algorithm  is  partitioned  EDF  (P-EDF),  which  uses  the  uniprocessor 
earliest-deadline-first  (EDF)  algorithm  as  the  per-processor  scheduler.  (Finder  EDF,  jobs  with 
earlier  deadlines  have  higher  priority.)  An  example  of  a  global  algorithm  is  global  EDF  (G-EDF), 
under  which  tasks  are  EDF-scheduled  using  a  single  priority  queue.  In  addition  to  “pure” 
partitioned  and  global  algorithms,  hybrid  approaches  are  possible.  One  such  approach  is  clustered 
scheduling,  under  which  tasks  are  partitioned  onto  clusters  of  processors,  with  those  in  each  cluster 
being  globally  scheduled  within  that  cluster  [5,  15]. 

Capacity  loss  in  multicore  systems.  When  assessing  schedulability  on  a  multiprocessor,  it  is 
often  necessary  to  constrain  per-task  utilizations  or  overall  utilization  in  some  way.  Such 
constraints  result  in  capacity  loss,  i.e.,  unavailable  processing  capacity.  For  example,  under 
partitioned  scheduling,  if  each  task  has  utilization  0.5  +  s,  then  only  m  tasks  (of  total  utilization 
approximately  m/2 )  can  be  scheduled  on  m  processors  because  only  one  task  can  be  assigned  to  a 
given  processor  without  overloading  it.  In  effect,  half  of  the  system’s  capacity  can  be  “lost.” 
Despite  this  limitation  of  partitioned  scheduling,  generally  speaking,  partitioned  algorithms  result 
in  less  capacity  loss  for  HRT  workloads  [10],  and  global  or  clustered  algorithms  result  in  less 
capacity  loss  for  SRT  workloads  [18,  33]. 

The  nature  of  the  capacity  loss  just  described  is  scheduler-related.  Additionally,  capacity  loss  can 
arise  due  to  system  overheads.  Such  overheads  include  activities  by  the  operating  system  (OS)  that 
take  away  processor  time  from  the  scheduled  tasks.  OS-related  overheads  include  processor  time 
lost  to  processing  interrupts,  making  scheduling  decisions,  enacting  context  switches,  etc.  These 
overheads  tend  to  be  reasonably  small  in  practice — overhead  values  of  a  few  microseconds  to  a 
few  tens  of  microseconds  are  common  [10].  In  contrast,  cache-related  overheads  also  must  be 
accounted  for,  and  these  can  be  larger,  on  the  order  of  tens  to  hundreds  of  microseconds  or  even  a 
few  milliseconds  [10].  Such  overheads  reflect  the  additional  cost  incurred  to  reload  previously 
cached  instructions  and  data  when  a  job  resumes  execution  after  a  preemption  or  migration.  The 
impacts  of  such  overheads  can  be  ameliorated  by  limiting  preemptions  or  migrations,  but  this  can 
cause  higher-priority  jobs  to  experience  priority  inversions,  i.e.,  they  may  be  temporally  blocked 
by  non-preemptive  lower-priority  jobs.  In  practice,  all  of  the  various  overheads  discussed  here 
must  be  accounted  for  in  schedulability  analysis.  Broadly  speaking,  this  involves  increasing  each 
task’s  WCET  by  an  amount  that  reflects  the  worst-case  overhead  one  of  its  jobs  may  experience. 

On  multicore  platforms,  the  presence  of  shared  hardware  such  as  caches,  memory  controllers, 
buses,  etc.,  can  cause  additional  capacity  loss.  In  particular,  when  task  WCETs  are  determined 
offline,  assumptions  regarding  contention  for  such  resources  must  be  made.  For  highly  critical 
HRT  tasks,  such  assumptions  will  tend  to  be  very  pessimistic.  For  example,  when  analyzing  such  a 
task’s  code,  if  a  data  reference  cannot  be  proved  with  certainty  to  result  in  a  cache  hit,  then  it  will 
be  assumed  that  such  a  reference  goes  to  memory.  With  complex  code,  it  can  be  very  difficult  to 
determine  which  memory  references  will  hit  in  some  cache.  As  a  result,  provisioned  task  WCETs 
may  be  much  larger  than  the  worst  case  that  can  be  seen  practice.  The  goal  of  our  work  on  shared 
cache  management  is  to  enable  shared  caches  on  multicore  platforms  to  be  more  predictably 
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managed  so  that  lower  WCETs  for  tasks  can  be  provisioned. 


Mixed-criticality  scheduling.  We  developed  our  shared  cache  management  framework  atop  a 
mixed-criticality  scheduler  developed  with  colleagues  from  NGC  under  AFOSR  support  [29,  46]. 
We  provide  a  brief  introduction  to  mixed-criticality  scheduling  here. 

In  many  embedded  systems,  the  severity  of  failure  is  not  the  same  for  all  tasks  in  the  system.  For 
example,  the  failure  of  one  task  may  cause  loss  of  life,  while  the  failure  of  a  different  task  may 
only  cause  degraded  system  performance.  Such  tasks  are  said  to  be  of  differing  criticalities.  The 
task  model  defined  at  the  beginning  of  this  subsection  is  agnostic  with  respect  to  criticalities:  under 
it,  each  task,  regardless  of  criticality,  is  provisioned  by  specifying  a  pessimistic  WCET.  A  system 
provisioned  in  this  way  may  be  reasonably  well-utilized  from  a  validation  and  certification 
perspective,  i.  e. ,  at  design  time,  but  be  severely  underutilized  in  practice,  i.  e. ,  at  run  time. 

A  technique  for  reclaiming  this  spare  capacity  has  been  proposed  by  Vestal  (who  was  working  in 
the  avionics  industry  at  Honeywell  at  the  time)  [55],  He  observed  that,  from  the  perspective  of 
scheduling  a  less  critical  task,  the  execution  times  assumed  of  more  critical  tasks  are  needlessly 
pessimistic.  Thus,  he  proposed  that  schedulability  tests  for  less  critical  tasks  be  altered  to 
incorporate  less  pessimistic  execution  times  for  more  critical  tasks — that  is,  per-criticality-level 
execution  times  are  assumed,  with  lesser  pessimism  for  lower  levels. 

More  formally,  in  a  system  with  L  criticality  levels,  L  system  variants  are  analyzed :  in  the  level-/ 
variant,  level-/  execution  times  are  assumed  for  all  tasks.  The  degree  of  pessimism  in  determining 
such  execution  times  is  level-dependent:  if  level  /  is  of  higher  criticality  than  level  /',  then  level-/ 
execution  times  will  generally  be  greater  than  level-//  execution  times.  For  example,  when 
certifying  a  system  at  the  highest  criticality  level,  provably  correct  upper  bounds  on  execution 
times  might  be  assumed  (as  computed,  for  example,  using  timing  analysis  tools  [59,  60]),  while 
when  certifying  at  a  lower  level,  observed  worst-case  times  from  profiling  might  be  assumed.  The 
task  model  resulting  from  Vestal’s  work  has  come  to  be  known  as  the  mixed-criticality  task  model. 
Assessing  correctness  in  a  criticality-cognizant  way  as  described  here  can  greatly  reduce  capacity 
loss,  particularly  when  relatively  few  higher-criticality  tasks  exist,  as  is  often  the  case  in  avionics 
systems. 

With  the  above  discussion  of  background  information  in  place,  we  now  describe  our  methods, 
assumptions,  and  procedures.  We  organize  this  description  as  follows:  first,  we  more  carefully 
describe  the  cache  management  problem  to  be  solved;  second,  we  describe  the  techniques  we 
developed  as  solutions  to  this  problem;  third,  we  describe  how  such  techniques  were  implemented. 

3.3  Cache  Management  Problem 

The  aforementioned  mixed-criticality  scheduler  devised  by  us  and  colleagues  at  NGC  is  called 
MC  (mixed-criticality  on  multicore)  [29,  46].  In  this  document,  we  consider  the  problem  of 
adding  proper  shared  cache  management  to  MC  when  supporting  periodic  tasks. 

In  MC2,  four  criticality  levels  exist,  denoted  A  (highest)  through  D  (lowest),  as  shown  in 
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Figure  1:  Scheduling  under  MC2  on  a  four-processor  system. 

Figure  1 .  Higher  criticality  tasks  are  statically  prioritized  over  lower  criticality  ones.  Level-A  tasks 
are  partitioned  and  scheduled  on  each  processor  using  a  table-driven  cyclic  executive  [4],  Level-B 
tasks  are  also  partitioned  but  are  scheduled  using  a  rate-monotonic  (RM)  scheduler  on  each 
processor.  Level-A  and  -B  tasks  are  required  to  be  simply  periodic  (all  tasks  commence  execution 
at  time  0  and  periods  are  harmonic).  Level-C  tasks  are  scheduled  via  a  G-EDF  scheduler.  Level-D 
tasks  are  scheduled  with  no  real-time  guarantees  on  a  best-effort  basis  (so  we  do  not  consider  them 
further).  A  task’s  execution  time  at  its  own  criticality  level  is  treated  as  an  OS  enforced  execution 
budget:  if  a  job  of  a  task  7)  has  an  execution  time  exceeding  T/s  budget,  then  more  than  one 
budget  allocation  will  be  required  to  service  it.  Level-A  and  -B  tasks  are  HRT,  while  level-C  tasks 
are  SRT.  Most  interesting  cache-related  issues  are  exposed  by  focusing  only  on  levels  B  and  C 
(one  HRT  level  and  one  SRT  level).  In  the  current  prototype  of  our  cache  management  framework, 
only  these  two  levels  are  assumed  to  be  present,  so  we  make  this  assumption  hereafter.  We  note 
that  systems  with  two  criticality  levels  have  been  the  predominate  focus  of  prior  work  on  mixed- 
criticality  scheduling  [6,  7,  19]. 

Page  coloring.  All  of  our  proposed  cache  management  schemes  utilize  page  coloring  in  some  way. 
We  now  describe  page  coloring  with  respect  to  an  NVIDIA  Tegra  T30  quad-core  ARM  Cortex  A9 
system,  which  is  the  machine  used  in  our  implementation-oriented  work. 

The  ARM  platform  used  in  our  experiments  has  four  cores  that  share  an  L2  cache.  In  this 
document,  we  consider  page  coloring  with  respect  to  this  cache.  The  L2  cache  on  this  platform  is  a 
1  MB  8-way  set  associative  cache:  it  stores  contents  of  physical  memory  in  32  B  units  called 
“lines,”  each  line  of  physical  memory  maps  to  a  particular  cache  “set,”  each  such  set  can  store  8 
lines  (equivalently,  there  are  eight  “ways”  per  set),  and  in  total  there  are  2  sets.  The  physical 
memory  of  this  platform  is  subdivided  into  4  KB  pages.  To  envision  the  coloring  process,  con¬ 
sider  each  page  in  sequence.  For  the  first  page  in  memory,  assign  the  color  “0”  to  it,  and  assign 
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the  same  color  to  the  cache  lines  to  which  its  contents  map.  Then,  since  each  page  consists  of 
4  KB/(32  B/line)  =  128  lines,  sets  1  -  128  are  assigned  color  0.  Repeat  this  process,  assigning 
color  1  (mapping  to  sets  129  -  256)  to  the  second  page  in  memory,  color  2  (sets  257  -  384)  to  the 
third  page,  and  so  on.  Then,  after  the  32nd  page,  all  2 12  sets  will  have  been  used  and  color 
assignments  will  “wrap,”  i.e.,  the  33  page  will  map  to  the  same  cache  sets  as  the  first,  so  we  reuse 
color  0  for  it.  Continuing  this  process,  each  page  will  be  assigned  to  one  of  32  colors.  Moreover, 
two  pages  that  are  assigned  different  colors  will  map  to  different  cache  sets  and  thus  cannot 
conflict  with  each  other  in  the  cache. 

In  Section  3.4,  we  consider  techniques  that  exploit  page  coloring  to  eliminate  or  control  cache 
conflicts.  In  discussing  these  techniques,  we  limit  attention  to  non-shared  task  data  pages,  as  only 
these  pages  are  managed  in  the  prototype  system  described  in  Section  3.6.  We  define  the  working 
set  size  ( WSS)  of  a  task  to  be  the  size  (in  bytes)  of  the  set  of  data  pages  it  may  access  in  one  job, 
i.e.,  the  size  of  its  per-job  working  set  ( WS ). 

Multiprocessor  real-time  locking.  Some  of  the  cache  management  schemes  we  consider  utilize 
multiprocessor  real-time  locking  protocols.  In  the  protocols  we  consider,  tasks  wait  by  suspending 
execution.  Locking  protocols  must  ensure  that  priority  inversion  blocking  (pi-blocking )  can  be 
analytically  bounded.  Pi-blocking  is  the  duration  of  time  a  job  is  blocked  while  a  lower-priority 
job  is  running.  Per- task  bounds  on  pi-blocking  are  required  when  analyzing  schedulability.  We  let 
bi  denote  the  pi -blocking  bound  for  task  T,-. 

On  a  multiprocessor  system,  the  actual  definition  of  pi-blocking  depends  on  how  schedulability 
analysis  is  done  [10].  For  some  schedulers,  suspensions  are  notoriously  difficult  to  analyze,  so 
working  set  size )  analysis  is  applied:  jobs  may  suspend,  but  each  ez-  must  be  analytically  inflated  by 
bt  prior  to  applying  a  schedulablity  test  to  account  for  lock-related  delays.  We  utilize  s-oblivious 
analysis  in  our  work.  Some  of  the  nuances  of  such  analysis  can  best  be  explained  by  comparing  it 
to  suspension-aware  (s-aware)  analysis,  which  explicitly  accounts  for  b,-  and  is  available  for  some 
schedulers. 

Since  suspended  jobs  are  counted  as  demand  under  s-oblivious  analysis,  the  mere  presence  of  m 
higher-priority  jobs  rule  out  a  priority  inversion,  whereas  only  ready  higher-priority  jobs  can 
nullify  a  priority  inversion  under  s-aware  analysis.  Accordingly,  under  s-oblivious  (respectively, 
s-aware)  schedulability  analysis,  a  job  J,  incurs  s-oblivious  (respectively,  s-aware )  pi-blocking  at 
time  t  if  J,  is  pending  but  not  scheduled  and  fewer  than  m  higher-priority  jobs  are  pending 
(respectively,  ready).  This  is  illustrated  in  Figure  2.  Prior  research  has  shown  that  s-aware  and  s- 
oblivious  analysis  are  comparable  in  terms  of  schedulability  achievable  in  practice  [10]. 

Cache-related  locking  problem.  We  now  describe  the  basic  synchronization  problem  that  arises 
when  using  locking  protocols  for  cache  management  (protocol-specific  details  are  discussed  in 
Section  3.4).  When  using  such  protocols,  each  color  is  viewed  as  a  shared  resource  that  has  a 
number  of  “replicas”  as  given  by  the  number  of  cache  ways,  as  illustrated  in  Figure  3.  Before  a  job 
commences  execution,  it  must  first  lock  a  replica  of  each  color  that  it  requires  (as  given  by  the 
pages  it  will  access).  If  the  job  accesses  r  pages  with  the  same  color,  then  it  must  lock  r  replicas  of 
that  color.  The  needed  synchronization  protocol  must  enable  a  set  of  shared  resources 
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Figure  2:  Example  (from  [10])  of  s-oblivious  and  s-aware  pi-blocking.  G-EDF  scheduling  on  two  processors  is 
assumed.  J3  suspends  during  [1,  3),  and  since  no  higher-priority  jobs  exist  it  is  pi-blocked  under  either 
definition.  J\,  suspended  during  [2,  4),  suffers  pi-blocking  under  either  definition  during  [3,  4)  since 
it  is  among  the  m  highest-priority  pending  jobs,  but  only  s-aware  pi-blocking  during  [2,  3)  as  J3  is 

pending  but  not  ready  then. 

to  be  managed,  where  each  resource  has  multiple  replicas,  and  jobs  may  need  to  lock  several 
replicas  simultaneously.  In  actuality,  such  a  protocol  is  utilized  by  the  OS  when  making  scheduling 
decisions,  i.e.,  the  jobs  themselves  do  not  acquire  and  release  color-related  locks.  This  means  that 
the  OS  must  know  the  pages  a  job  will  access  prior  to  making  a  scheduling  decision. 

3.4  Cache  Management  Techniques 

In  this  section,  we  present  techniques  for  managing  accesses  to  a  shared  cache  by  level-B  tasks  in 
MC  (recall  our  assumption  that  level  A  is  not  present).  Our  goal  is  to  enable  WCET  reductions 
and  greater  predictability  in  such  tasks.  We  do  not  apply  these  cache  management  techniques  to 
level-C  tasks  as  we  expect  the  shared  cache  to  provide  reasonable  average-case  performance 
(which  should  be  acceptable  for  SRT  tasks).  We  assume  that  the  last  level  of  the  shared  cache  is 
the  one  that  is  managed  ( e.g .,  the  L2  cache  of  our  ARM  platform).  Furthermore,  we  assume  that 
tasks  of  different  criticality  levels  are  partitioned  in  the  cache,  i.e.,  they  are  allocated  such  that  they 
do  not  share  colors  (support  for  such  allocation  is  described  in  Section  3.6).  The  term  “task”  is 
used  to  refer  to  a  level-B  task  in  the  rest  of  this  section. 

We  begin  by  broadly  describing  color  management  strategies  and  our  assumptions.  First,  we 
assume  an  inclusive  write-back  cache  for  which  the  OS  can  precisely  control  which  cache  sets  and 
ways  a  task’s  data  is  loaded  into.  Second,  we  assume  that  the  OS  has  access  to  hardware 
mechanisms  that  enable  ways  of  the  cache  to  be  locked  (and  later  unlocked)  such  hat  data  in  a 
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Figure  3:  Color  locking. 


locked  way  is  protected  from  being  evicted  (equivalently,  the  way  is  unavailable  for  allocation). 
Third,  we  assume  that  the  physical  memory  pages,  and  therefore  the  color  requirements  of  each 
task,  have  been  previously  assigned,  and  that  tasks  have  been  partitioned  onto  processors.  Finally, 
we  assume  that  the  memory  a  task  uses  is  pre-allocated  before  the  task  system  begins  execution,  as 
is  common  in  real-time  systems.  We  discuss  the  realization  of  the  first  two  points  on  our  test 
platform  in  Section  3.6.  In  Section  4.2,  we  address  the  third  requirement  by  describing  methods  to 
assign  colors. 

The  effect  of  these  assumptions  is  that  once  data  is  loaded  into  a  cache  line  and  it  is  locked,  all 
subsequent  memory  accesses  to  that  data  will  hit  in  the  cache.  With  this  property,  uniprocessor 
timing  analysis  tools,  which  are  much  less  pessimistic  than  multicore  ones,  can  be  employed  to 
more  accurately  estimate  WCETs  for  level-B  tasks.  Furthermore,  observed  WCETs  will  be 
improved  as  the  number  of  cache  misses  is  provably  reduced. 

Under  these  assumptions  the  cache  can  be  treated  as  either  a  non-preemptive  or  a  preemptive 
resource.  This  gives  rise  to  four  different  classes  of  allocation  policies,  depending  upon  whether 
the  processor  or  the  cache  is  preemptive.  We  claim,  however,  that  either  both  should  be 
preemptive,  or  both  should  be  non-preemptive.  If  the  processor  is  non-preemptive  and  the  cache  is 
preemptive,  then  it  is  possible  for  a  job  to  be  scheduled  while  its  needed  cache  colors  have  been 
preempted  ( i.e .,  are  not  available  for  it  to  use).  Alternatively,  if  the  cache  is  non-preemptive  and 
the  processor  is  preemptive,  then  colors  could  be  locked  by  a  job  that  is  later  preempted  and  thus 
unable  to  use  them.  We  therefore  consider  only  the  non-preemptive  case,  which  we  call  cache 
locking,  and  the  preemptive  case,  which  we  call  cache  scheduling. 
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Figure  4:  Illustration  of  the  RNLP  queue  structure. 


3.4.1  Cache  Locking 

Under  cache  locking,  a  job  must  hold  a  color  lock  for  all  of  its  needed  colors  before  execution, 
and  it  does  not  release  its  color  locks  until  the  end  of  its  execution,  i.e.,  its  execution  requirement 
is  a  critical  section.  This  ensures  cache  isolation  for  each  job  during  the  entirety  of  its  execution. 
This  policy  can  be  realized  by  using  a  multiprocessor  real-time  locking  protocol  to  arbitrate 
access  to  colors  and  treating  each  job’s  execution  time  as  a  critical  section. 

For  this  purpose,  we  leverage  the  Real-Time  Nested  Locking  Protocol  (RNLP)  [56],  a 
recently  developed  multiprocessor  real-time  locking  protocol  that  optimally  supports  the 
simultaneous  locking  of  multiple  resources.  The  RNLP  controls  access  to  all  cache  colors  and  their 
respective  ways.  For  each  way  of  each  color,  there  is  an  associated  FIFO-ordered  way  queue  of 
jobs.  This  architecture  is  depicted  in  Figure  4.  The  head  of  each  way  queue  is  assumed  to  have 
acquired  the  associated  way,  though  it  does  not  execute  until  it  has  acquired  all  needed  ways.  A 
job  Jt  atomically  requests  all  colors  it  requires  before  it  can  commence  execution.  For  each  color  c 
from  which  Jz  requests  rc  ways,  is  enqueued  in  the  shortest  rc  way  queues  associated  with  c.  A 
job  in  a  way  queue  that  is  either  waiting  for  a  resource  or  scheduled  with  its  needed  ways  is 
considered  non-preemptive.  Therefore,  no  other  jobs  on  the  processor  can  be  either  scheduled  or 
enqueued  in  the  way  queues.  Because  there  can  be  at  most  m  jobs  total  in  all  way  queues,  and 
because  jobs  enqueue  in  the  shortest  way  queues,  the  maximum  duration  of  blocking  for  all  cache 
colors  is  0(mr/k)  where  k  is  the  number  of  ways  available  and  r  is  the  maximum  number  of  ways 
per  color  requested  by  any  job.  Under  s-oblivious  analysis,  a  non-preemptive  job  is  analytically 
treated  as  scheduled,  even  when  it  is  not  actually  scheduled.  This  non-preemptive  job  can  cause 
0(mr/k )  non-preemptive  blocking  for  other  jobs.  Thus,  the  total  duration  of  blocking  is  0(mr/k). 

Example  1.  Consider  a  two-processor  system,  with  two  tasks  on  each  processor,  as  depicted  in 
Figure  5.  For  simplicity,  assume  that  the  cache  is  direct-mapped  {i.e.,  only  one  way  per  color). 
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Figure  5:  Example  of  the  trade-off  between  priority  inversion  and  improved  WCET  (explained  in  text). 

Assume  that  jobs  A  and  A  (respectively,  A  and  A)  are  partitioned  onto  processor  P\  (respectively, 
P2).  Also,  assume  the  lower-indexed  jobs  have  higher  priority.  Let  jobs  A  and  A  share  blue 
(striped  in  the  figure),  and  jobs  A  and  A  share  red  (solid).  If  all  jobs  are  released  synchronously 
on  both  processors,  then  without  cache  locking,  both  jobs  A  and  A  would  be  concurrently 
scheduled,  and  would  conflict  in  the  cache.  However,  as  shown  in  Figure  5,  A  is  concurrently 
scheduled  with  A-  These  jobs  do  not  conflict  in  the  cache.  When  A  completes,  A  and  A  can 
execute  without  conflicting  in  the  cache.  In  this  example,  there  is  a  priority  inversion,  when  job  A 
cannot  run,  and  instead  the  lower-priority  job  A  executes.  Our  goal  is  to  show  that  the  effect  of 
such  priority  inversions  is  offset  by  the  improved  WCETs  afforded  by  cache  isolation. 

Period  splitting.  Non-preemptive  processor  scheduling,  a  byproduct  of  our  locking  protocol,  can 
cause  pi-blocking  that  can  be  detrimental  to  the  schedulability  of  a  task  set. 

Example  2.  Consider  a  system  where  A  has  a  higher  priority  than  Tj.  If  eJ  ^  Pi,  then  the  system 
may  be  unschedulable  because  Tj  might  be  non-preemptive  throughout  the  entirety  of  Tf  s  period, 
causing  Tj  to  miss  its  deadline. 

The  undesirable  effects  of  non-preemptivity  can  be  ameliorated  by  exploiting  the  fact  that  cache- 
related  critical  sections  are  not  “true”  critical  sections:  they  can  be  preempted,  albeit  with  an 
additional  cost  to  reload  evicted  cache  lines  that  are  later  required.  Under  s-oblivious  analysis 
(recall  from  Section  3.3),  non-preemptive  blocking  can  be  eliminated  using  a  technique  called 
period  splitting.  Under  period  splitting,  each  task’s  period  is  set  to  the  shortest  period  in  the  system 
and  its  execution  time  is  scaled  accordingly.  Thus,  each  job  actually  executes  as  a  sequence  of 
subjobs. 

Example  3.  Consider  a  system  with  two  tasks,  Tx  and  T2,  where  e\  =  1,  p\  =  3,  e2  =  6,  and 
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p2  =  9.  Under  period  splitting,  T2 s  period  is  reduced  to  match  p\,  without  altering  its  utilization. 
Thus,  after  period  splitting,  e2  =  2  and  p2  =  3. 

Since  level  B  in  MC  is  simply  periodic,  period  splitting  ensures  that  subjobs  are  always  re-  leased 
in  phase.  Under  this  assumption,  the  RNLP  guarantees  that  all  s-oblivious  pi-blocking  is  caused  by 
tasks  on  remote  processors. 

Job  splitting.  Such  remote  blocking  can  clearly  cause  the  system  to  be  unschedulable.  It  is 
particularly  detrimental  if  critical-section  lengths  are  highly  variant. 

Example  4.  Consider  a  system  in  which  a  color  is  shared  between  two  tasks  7j  and  T2,  where 
e\  =  1  and P\  =  e2  =  16.  If  7j  blocks  for  16  time  units  on  T2,  then  it  will  miss  its  deadline. 

We  can  mitigate  such  detrimental  blocking  by  breaking  each  task  into  multiple  subtasks  that  are 
scheduled  on  the  same  processor  and  that  all  have  the  same  period,  but  smaller  execution  times, 
such  that  all  subtask  utilizations  sum  to  the  original  task’s  utilization.  This  is  similar  to  inserting 
preemption  points  into  jobs  of  a  task,  but  is  implemented  by  enforcing  budgets.  We  call  this 
technique  job  splitting.  Note  the  difference  between  job  and  period  splitting:  under  job  splitting, 
tasks  are  broken  into  many  subtasks  with  smaller  utilizations,  while  under  period  splitting,  a  task’s 
period  is  reduced,  while  maintaining  its  utilization. 

Example  4  (cont’d).  If  T2  is  split  into  16  subtasks,  each  with  an  execution  cost  of  one,  then  worst- 
case  blocking  for  7j  is  improved  from  16  to  one. 

Job  and  period  splitting,  which  can  potentially  be  applied  together,  can  reduce  pi-blocking  bounds 
by  ameliorating  the  adverse  effects  of  highly  variant  critical-section  lengths.  However,  under 
either  scheme,  there  is  additional  overhead  for  each  subjob,  such  as  scheduling  decisions  and 
reloading  cache  lines.  Such  overheads  could  be  prohibitive  if  splitting  is  too  “fine-grained.” 

3.4.2  Cache  Scheduling 

While  non-preemptively  scheduling  tasks  with  respect  to  the  cache  maximizes  reuse  of  cache  lines 
within  each  job,  it  can  cause  adverse  blocking,  as  we  have  seen.  Alternatively,  we  can  treat  the 
cache  as  a  preemptive  resource ,  a  technique  we  call  cache  scheduling.  In  this  case,  color  replicas 
are  preemptively  “scheduled.”  When  a  task  is  scheduled  with  respect  to  a  set  of  color  replicas,  it 
has  exclusive  access  to  those  replicas  and  thus  will  not  experience  cache  conflicts  with  tasks  on 
remote  processors.  However,  it  may  be  preempted  to  allow  a  higher-priority  task  to  access  some 
replica.  Similar  to  job  and  period  slicing,  such  a  preemption  may  force  the  preempted  task  to 
reload  its  WS.  However,  the  cost  of  such  reloads  can  be  analyzed  and  incorporated  into 
schedulability  tests  using  existing  techniques  as  described  in  [10,  Chapter  3], 

Example  5.  Consider  the  two-processor  schedule  depicted  in  Figure  6,  where  each  processor, 
P\  and  P2,  has  two  assigned  tasks  with  utilization  1/4.  Tasks  7j  and  T3  are  defined  by  (eu  p\)  = 
(e3,p3)  =  (1,  4)  and  both  share  red  and  green.  Task  T2  is  defined  by  (e2,  p2)  =  (2,  8)  and  requires 
green  only,  while  T4  is  defined  by  (e4,  p4)  =  (4,  16)  and  requires  red  only.  RM  priorities  are 
applied  to  tasks  on  both  the  cache  and  their  r  espective  processors.  Thus,  tasks  7)  and  7/  are 
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Figure  6:  Preemptively  scheduling  the  cache. 

scheduled  before  T2  and  T4.  At  time  t  =  4,  T4  is  preempted  by  the  higher-priority  task  Tx  on 
processor  Px . 

We  believe  that  by  viewing  cache  colors  as  preemptive  processors  we  expose  a  new  scheduling 
problem.  The  problem  is  how  to  schedule  and  verify  the  schedulability  of  a  task  system  in  which 
each  task  Tf  has  an  additional  parameter,  R,-,  which  gives  a  set  of  processor  requirements.  A 
processor  requirement  (a,  5)  e  R,  specifies  that  for  T,-  to  execute,  it  must  be  scheduled  on  a 
processors  from  the  set  of  processors  5. 

Example  6.  For  a  task  T,-  assigned  to  processor  P2  requiring  four  replicas  of  color  blue,  of  which 
there  are  16  ways,  Bu...,  Bl6,  we  have  Rf  =  {(1,  {P2}),  (4,  {Bh 516})}. 

This  scheduling  problem  is  a  generalization  of  gang  scheduling  [30],  and  scheduling  with 
processing  set  restrictions  [35],  We  leave  this  general  problem  for  future  work,  but  present  more 
simplistic  analysis  by  reducing  the  problem  to  a  uniprocessor  scheduling  problem. 

Reduction  to  uniprocessor  analysis.  Fundamental  to  our  analysis  is  the  observation  that  there  are 
task  systems  for  which  the  processor  may  not  be  the  constrained  resource — instead  the  cache  may 
be  overutilized,  causing  the  task  system  to  be  unschedulable. 

Example  7.  Consider  two  tasks  that  share  the  same  set  of  colors  and  have  a  total  utilization  greater 
than  one.  The  tasks  may  be  partitioned  onto  either  one  or  two  processors.  In  the  former  case,  the 
processor  is  overutilized,  and  the  task  system  is  unschedulable.  In  the  latter  case,  the  tasks  must 
run  sequentially  to  ensure  cache  isolation,  and  thus  the  task  system  is  unschedulable  despite  the 
additional  processing  capacity  of  the  second  processor.  This  is  because  the  cache  is  the  constrained 
resource,  not  the  processors. 

We  next  formalize  the  reduction  to  uniprocessor  analysis  mentioned  above  and  demonstrate  how 
the  utilization  of  cache  colors  can  be  evaluated  to  check  schedulability  with  respect  to  both  cache 
colors  and  processors. 


We  define  a  binary  relation  for  direct  contention  D,  such  that  two  tasks  are  related  if  they  share 
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colors  or  a  processor, 


D  =  \(T,;  Tj)  e  z2  |  C,  fl  Cj  ^0\!P{Ti)  =  P(Tj)}, 

where  Cz  is  the  set  of  colors  that  task  Tt  requires,  and  P  ( Tt )  is  the  processor  on  which  7)  is 
partitioned.  Let  D+  be  the  transitive  closure  of  direct  contention,  D.  We  define  each  equivalence 
class  in  D+  to  be  a  logical  cache  processor.  By  definition,  all  tasks  on  the  same  physical  processor 
must  be  on  the  same  cache  processor,  and  thus  the  number  of  cache  processors  is  at  most  m. 

We  evaluate  the  schedulability  of  each  cache  processor  as  a  uniprocessor,  and  apply  known 
schedulability  tests.  In  MC  ,  which  uses  RM  priorities  on  each  physical  processor  and  simply 
periodic  periods,  the  schedulability  test  for  a  physical  processor  Pj  is  Er,  on  pj  «,  >  1.  We  can  also 
schedule  the  cache  using  RM  priorities,  and  allow  new  jobs  of  high-priority  tasks  to  preempt  low- 
priority  tasks  with  respect  to  the  cache.  Thus,  the  resulting  schedulability  test  is  the  same:  the  total 
utilization  of  each  cache  processor  is  at  most  one.  We  do  not  need  to  explicitly  evaluate  the 
schedulability  of  the  physical  processors,  since  the  set  of  tasks  on  each  physical  processor  is  a 
subset  of  the  tasks  on  the  respective  cache  processor.  Thus,  if  each  cache  processor  is  schedulable, 
then  each  physical  processor  is  as  well. 

Example  5  (cont’d).  Because  tasks  on  processors  Pi  and  P2  share  the  same  set  of  colors  (red  and 
green),  they  form  a  single  cache  processor.  Since  the  utilization  of  the  four  tasks  is  one,  both  the 
cache  and  the  processors  are  schedulable.  If  all  tasks  required  both  red  and  green,  then  the  tasks 
would  necessarily  be  serialized  as  if  they  were  on  a  single  processor  (assuming  cache  isolation  is 
required).  However,  because  tasks  73  and  T\  do  not  share  the  same  colors,  they  can  execute 
concurrently,  as  depicted  in  Figure  6. 

Note  that  this  schedulability  condition  is  only  sufficient  and  may  be  pessimistic:  it  may  be 
possible  for  tasks  to  run  concurrently  on  different  physical  processors  if  they  require  disjoint 
colors.  This  pessimism  can  be  avoided  in  two  ways:  through  tighter  analysis  of  the  aforementioned 
more  general  scheduling  problem,  or  by  assigning  colors  to  tasks  in  a  way  that  reflects  our 
uniprocessor  analysis,  as  discussed  later. 

3.5  Related  Work 

Prior  work  exists  on  cache  management  that  is  similar  to  ours  in  some  respects.  An  approach 
called  cache  lockdown  (not  to  be  confused  with  our  use  of  the  term  “locking”)  has  been  proposed 
wherein  designated  cache  lines  are  “locked  down”  in  the  cache  so  that  they  cannot  be  evicted 
[16].  Similarly,  an  approach  called  cache  partitioning  has  been  proposed  that  attempts  to  mitigate 
the  impact  of  cache  conflicts  by  allocating  sections  (or  partitions)  of  the  cache  to  specific  tasks 
(see  [32]  for  an  overview).  Cache  partitioning  can  be  done  automatically  by  the  compiler  [47],  but 
the  source  code  of  programs  must  be  available  for  compilation  and  large  portions  of  memory  must 
be  allocated  as  padding  to  achieve  the  desired  code  and  data  placement.  To  remedy  this, 
partitioning  at  the  OS  level  was  proposed  [36],  This  approach  can  be  applied  dynamically, 
transparently,  and  without  access  to  application  source  code.  However,  it  may  be  difficult  to  size 
partitions  so  that  the  cache  is  efficiently  utilized  from  a  system-wide  perspective. 
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In  general,  the  problems  of  optimally  assigning  tasks  to  processors  and  colors  to  tasks  are  both 
NP-hard  in  the  strong  sense  [14],  though  suboptimal  heuristic -based  algorithms  have  been 
proposed  and  evaluated  on  different  hardware  platforms  [26,  48,  50].  Cache  partitioning  is  actually 
a  special  case  of  our  cache  locking  approach  in  which  inter-task  cache  conflicts  are  entirely 
eliminated,  and  hence  the  usage  of  a  locking  protocol  is  obviated.  However,  our  generalization 
allows  more  flexibility  in  determining  color  assignments  and  can  enable  greater  dynamism  at 
runtime. 

A  system  execution  model  called  PREM  [51]  has  been  proposed  that  takes  an  approach  similar  to 
ours.  Specifically,  PREM  uses  scheduling  to  reduce  or  eliminate  contention  for  shared  resource 
accesses,  including  main  memory.  However,  unlike  our  proposed  cache  management  techniques, 
PREM  is  restricted  to  single-core  systems.  An  extension  to  PREM  [61]  examines  memory-centric 
Time  Division  Multiple  Access  (TDMA)-based  scheduling  on  multicore  processors,  but  assumes 
no  shared  resources  among  cores  except  memory — each  core’s  last-level  cache  is  private. 

In  work  on  timing  analysis,  methods  pertaining  to  memory  hierarchies  have  been  proposed  ( e.g ., 
see  [34,  52]  and  the  references  therein).  Related  hardware-based  techniques  include  cache  bypass 

[28] ,  which  reduces  cache  conflicts  by  exploiting  special  hardware  instructions,  and  methods  for 
making  multicore  platforms  more  predictable  at  the  processor  [49]  and  interconnect  levels  [3,  27, 
53,  54],  In  contrast  to  this  work,  our  approaches  are  software-based. 

3.6  Implementation 

We  implemented  the  cache  management  techniques  described  above  in  an  MC2  prototype  that  was 
itself  implemented  within  a  real-time  extension  of  Linux  called  LITMUSrt  (Linux  Testbed  for 
Multiprocessor  Scheduling  in  Real-Time  systems)  [1].  LITMUSrt  was  developed  previously  by 
our  group  under  ARO  and  NSF  support.  LITMUSrt  extends  Linux  (currently,  version  3.0)  by 
allowing  different  (multiprocessor)  scheduling  algorithms  and  synchronization  protocols  to  be 
linked  as  plug-in  components.  Being  based  on  Linux,  LITMUSrt  is  not  a  real-time  OS  (RTOS) 
that  could  conceivably  be  used  in  safety  critical  domains.  However,  we  chose  it  as  our 
development  platform  because  it  is  open  source.  This  choice  of  platform  should  not  be  a 
limitation:  any  resource  allocation  principles  identified  in  our  work  should  be  easily  transferable  to 
other  RTOSs. 

In  developing  our  implementation,  we  restricted  attention  to  tasks  that  are  independent  (no  shared 
resources  other  than  cache  lines),  have  only  memory-resident  pages,  and  that  do  not  share  pages 
with  each  other.  We  used  the  ARM  machine  described  in  Section  3.3  as  our  development  platform. 
All  page  coloring  was  done  with  respect  to  the  last  level  of  cache,  which  on  this  machine  is  its  L2. 
In  prior  work,  we  found  that  level-B  schedulability  is  greatly  improved  if  one  CPU  is  designated 
as  a  release  master  that  processes  all  job-release  interrupts  and  is  not  assigned  any  level-B  tasks 

[29] .  We  employed  a  release  master  in  our  current  prototype,  so  level-B  tasks  are  actually  only 
scheduled  on  three  CPUs  (level-C  tasks  can  execute  on  the  release  master). 

As  noted  earlier,  we  implemented  coloring  with  respect  to  tasks’  data  pages.  We  leave  the 
coloring  of  other  areas  of  memory  as  future  work.  We  believe  this  is  reasonable  for  a  first 
prototype,  as  our  implemented  tasks  have  a  small  per-job  code  footprint  that  does  not  make  any 
system  calls  and  operate  only  on  the  colored  memory  they  allocate.  We  implemented  a  memory 
allocation  function  similar  to  malloc  that  modifies  the  page  tables  of  the  backing  user  process 
for  each  task  to  map  pages  with  the  proper  colors  into  its  contiguous  virtual  address  space. 

Controlling  eviction.  When  discussing  various  page-coloring-oriented  cache  management 
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techniques  in  Section  3.3,  we  assumed  that  the  OS  can  precisely  control  the  cache  sets  and  ways 
that  a  task’s  data  is  loaded  into.  In  our  implementation,  this  functionality  is  achieved  by  exploiting 
cache  lockdown  using  a  clever  method  proposed  by  Mancuso,  et  al.  [43],  As  noted  earlier,  cache 
lockdown  allows  certain  ways  to  be  marked  as  unavailable  for  allocation  (or  locked  down)  such 
that  the  contents  of  a  locked  way  cannot  be  evicted. 

The  approach  in  [43]  utilizes  a  variant  of  cache  lockdown  called  Lockdown  by  Master  ( LbM ), 
where  each  CPU  Pq  has  access  to  a  per-CPU  lockdown  register  xq  such  that  bit  i  in  xq  is  zero  if 
allocation  can  occur  in  way  i  for  memory  references  from  CPU  Pq,  and  one  otherwise.  LbM  is 
supported  on  our  ARM  platform.  As  noted  by  the  authors  of  [43],  setting  all  but  one  bit  of  rc  to 
one  pigeonholes  memory  requests  from  CPU  Pq  to  be  allocated  in  a  specific  cache  way.  By 
applying  this  idea,  memory  can  be  prefetched  by  reading  it  in  a  “prefetch”  loop  such  that  the  loop 
code  occupies  at  most  one  cache  line  and  is  cache-line  aligned.  During  prefetching,  CPU-local 
interrupts  must  be  disabled  to  avoid  interrupt-related  cache  pollution.  To  ensure  that  each 
prefetched  cache  line  is  read  into  the  proper  way,  none  of  the  memory  to  be  prefetched  can  be 
cached  elsewhere.  Also,  if  a  CPU  is  prefetching  into  way  w,  then  the  other  CPUs  cannot  have  w 
unlocked  unless  they  access  memory  that  is  of  a  different  color. 

4  RESULTS  AND  DISCUSSION 

In  this  section,  we  describe  the  results  of  an  evaluation  that  we  conducted  to  assess  the  efficacy  of 
our  cache  management  framework.  In  Section  4.1,  we  describe  our  evaluation  methodology.  In 
Section  4.2,  we  describe  our  experimental  results. 

4.1  Evaluation 

We  evaluated  our  implementation  by  measuring  system  overheads  (including  cache  prefetching 
and  flushing  costs)  and  observed  WCET  improvements.  To  obtain  these  measurements,  we  traced 
the  behavior  of  task  sets  with  varying  task  counts,  where  periods  and  utilizations  were  generated 
uniformly  from  {25,  50,  100,  200}ms  and  [0.01,  0.05],  respectively.  Level-B  tasks  were  assigned 
to  cores  using  the  worst- fit  heuristic.  (Any  task  set  that  could  not  be  so  assigned  was  discarded.) 
Cache  prefetching  was  turned  off.  Each  task  was  defined  via  a  per-job  code  sequence  in  which  it 
reads  its  WS  in  a  random  order  one  or  more  times. 

Scaling  factors.  A  job  executed  under  cache  locking  or  cache  scheduling  does  not  suffer  cache 
misses  and  therefore  completes  execution  earlier  than  it  would  in  a  system  without  cache 
management.  We  use  scaling  factors  to  denote  the  ratio  of  per-job  execution  times  without  and 
with  cache  management,  e.g.,  if  cache  scheduling  reduces  a  task’s  per-job  execution  time  from  10 
ms  to  2  ms,  then  this  task  has  a  scaling  factor  of  five.  The  worst-case  scaling  factors  on  our  test 
platform  were  on  the  order  of  3.5  to  4.5,  while  average-case  scaling  factors  were  between  3.3  and 
3.9. 

In  Section  4.2,  we  use  these  scaling  factors  to  compare  schedulability  with  and  without  cache 
management  in  the  presence  of  system  overheads.  While  it  might  be  preferable  to  use  WCETs 
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Figure  7:  Each  graph  shows  average-case  (orange,  textured)  and  worst-case  (blue,  solid)  measurements. 

(a)  Scheduling  overheads,  (b)  Prefetch  and  flush  overheads  for  various  WSSs. 

predicted  by  timing  analysis  tools  in  such  comparisons,  adequate  tools  for  multicore  platforms  do 
not  yet  exist.  Also,  note  that  observed  WCETs  lower  bound  predicted  ones  (if  the  prediction  is 
safe).  Thus,  observed  values  give  some  indication  of  how  a  tool  might  perform  given  varying 
degrees  of  information  about  cross-core  cache  interactions. 

System  overheads.  Cache  management  increases  scheduling  costs  and  consequently  the 
duration  of  OS  overheads  interrupting  task  execution.  Additional  overhead  arises  under  cache 
locking  due  to  locking-protocol  overhead,  and  under  cache  scheduling  due  to  added  complexity  for 
maintaining  cache  processor  state.  Figure  7(a)  depicts  scheduling  overheads  ( i.e .,  the  time  taken  by 
the  OS  to  make  a  scheduling  decision)  for  level  B  assuming  no  cache  management  ( Unmanaged ), 
cache  locking  {Locking),  and  cache  scheduling  {Scheduling).  Observe  that  worst-case  overheads 
were  increased  by  around  60  jus  for  cache  scheduling  and  50  jus  for  cache  locking. 

The  cache  lockdown  prefetching  operation  must  also  be  considered.  A  task  will  not  begin 
execution  until  its  assigned  processor  has  cached  the  entirety  of  its  WS.  In  Figure  7(b),  we  show 
this  total  prefetching  cost  for  different  WSSs.  The  worst-case  cost  varied  from  400  jus  to  900  jus  in 
our  experiments,  depending  on  WSS.  With  respect  to  schedulability,  these  costs  and  higher 
scheduling  overheads  are  offset  by  lower  WCETs  as  we  show  next. 

4.2  Schedulability  Study 

In  this  section,  we  evaluate  the  utility  of  our  proposed  cache  management  techniques  from  a 
schedulability  perspective,  with  measured  overheads  considered,  and  examine  the  tradeoff  between 
improved  WCETs  enabled  by  our  techniques  and  any  utilization  loss  due  to  cache  management. 

Utilization  scaling.  To  evaluate  the  schedulability  of  a  task  system  using  cache  management,  we 
scale  the  level-B  utilization  of  tasks  scheduled  without  cache  management  by  a  scaling  factor 
commensurate  with  those  observed  in  Section  4.1.  This  utilization  scaling  theoretically  allows  a 
scheduler  using  our  cache  management  techniques  to  schedule  task  systems  that,  unmanaged, 
contain  tasks  with  level-B  utilizations  greater  than  one  or  have  a  total  level-B  utilization  greater 
than  the  number  of  processors.  For  example,  using  our  cache  management  techniques,  it  may  be 
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possible  to  schedule  on  four  processors  a  task  system  with  a  total  unmanaged  level-B  utilization  of 
five  and  that  contains  a  task  with  an  unmanaged  utilization  of  1.5.  These  observations  motivate  the 
experimental  design  of  our  schedulability  study,  as  is  discussed  next. 

We  evaluated  the  schedulability  of  randomly  generated  task  systems  having  equal  level-B  and  -C 
subsystem  utilizations.  That  is,  for  each  system  utilization  x,  we  generated  task  systems  where  the 
sum  of  all  level-B  tasks’  level-B  utilizations  was  equal  to  x,  and  the  sum  of  all  level-B  and  -C 
tasks’  level-C  utilizations  together  also  summed  to  x  (recall  that  in  mixed-criticality  scheduling,  a 
task  has  an  execution  cost,  and  hence  utilization,  for  each  criticality  level).  The  level-C  utilization 
of  each  level-B  task  was  assumed  to  be  10%  of  its  level-B  utilization.  We  considered  task  systems 
with  utilizations  without  cache  management  of  {1.0,  1.1,  .  .  .  ,  10.0}  with  respect  to  the 
previously  described  system  (four  processors,  1MB  8-way  set  associative  cache,  with  512  KB 
allocated  to  each  criticality  level).  Both  the  level-B  utilizations  of  level-B  tasks  and  the  level-C 
utilizations  of  level-C  tasks  were  uniformly  distributed  over  [0.1, 0.4]  or  [0.5, 0.9],  though  in  all 
graphs  presented  herein,  which  depict  relevant  trends,  the  former  utilization  distribution  is 
assumed.  We  note  that  greater  per-task  utilizations  could  be  supported  using  cache  management 
( e.g .,  1.5).  However,  we  did  not  consider  such  systems  in  our  evaluations  as  there  is  no  basis  for 
comparison  with  a  system  with  an  unmanaged  cache  (which  cannot  schedule  tasks  with  utilizations 
greater  than  1.0). 

Schedulability  was  determined  by  evaluating  the  generated  task  systems  using  the  level-B  and  -C 
schedulability  conditions  given  in  [46],  with  overheads  factored  in  using  the  techniques  described 
in  [10,  Chapter  3].  Worst-case  (average-case)  overheads  were  assumed  for  level  B  (C),  as  level  B 
(C)  is  HRT  (SRT).  Comparisons  between  systems  with  and  without  cache  management  were 
performed  by  generating  task  systems  for  an  unmanaged  system,  and  then  scaling  task  execution 
times  by  scaling  factors  commensurate  with  those  observed  in  Section  4. 1 . 

Coloring  schemes.  We  investigated  two  heuristics  for  assigning  colors  to  level-B  tasks.  Each  was 
designed  to  reduce  cross-core  color  contention,  since  such  contention  is  detrimental  to  both  cache 
locking  and  cache  scheduling.  Both  heuristics  function  similarly  and  are  applied  assuming  that 
tasks  have  been  previously  assigned  to  physical  processors  and  that  physical  memory  pages  are 
unconstrained  (conceptually,  there  are  infinitely  many  physical  memory  pages  available  for  each 
color — note  that  our  test  platform  has  2 13  pages  per  color).  Letting  S  denote  the  size  of  the  cache 
and  W  denote  the  maximum  WSS,  we  divide  the  cache  into  Ls/JTJ  “bins”,  each  of  size  W .  Under 
way-first  ( color-first )  allocation,  the  number  of  ways  (colors)  in  each  bin  is  maximized.  For 
example,  a  bin  half  the  size  of  an  8-way,  32-color  cache  has  8  ways  of  16  colors  under  way- first 
allocation,  and  4  ways  of  16  colors  under  color- first  allocation.  Under  either  heuristic,  processors 
are  packed  into  the  obtained  bins  using  the  worst-fit  heuristic,  and  each  task  is  assigned  colors  to 
satisfy  its  WSS  from  the  set  of  colors  assigned  to  the  processor  on  which  it  is  partitioned.  Note  that 
if  W  <  S/m,  then  our  heuristics  partition  the  cache. 

These  heuristic  are  more  flexible  than  cache  partitioning,  as  they  allow  colors  to  be  shared  across 
processors.  However,  the  problem  of  assigning  colors  to  tasks  is  generally  intractable  and  may  be 
particularly  hard  if  the  number  of  physical  memory  pages  per  color  is  severely  constrained.  In 
many  embedded  systems,  such  constraints  may  exist  due  to  SWaP  requirements,  the  need  to 
support  separate  processing  modes,  etc.  Nonetheless,  the  experiments  below  demonstrate  that 
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if  a  reasonable  color  assignment  can  be  found,  then  schedulability-related  impacts  can  be 
dramatic.  We  plan  to  further  investigate  color  assignment  strategies  in  the  future  to  widen  the 
applicability  of  our  work. 

Schedulability.  We  now  discuss  several  schedulability  graphs  involving  randomly  generated  task 
systems.  A  cache  management  scheme’s  schedulability  is  specified  with  respect  to  a  set  S  of 
generated  task  systems  and  is  defined  as  the  fraction  of  S  deemed  schedulable  under  that  scheme. 
We  present  level-B  (HRT)  and  level-C  (SRT)  schedulability  graphs  in  Figure  8.  In  the  level-B 
graphs,  schedulability  is  plotted  versus  total  level-B  utilization  prior  to  scaling.  In  the  level-C 
graphs,  it  is  plotted  versus  total  level-B  and  -C  utilization  assuming  level-C  execution  costs  for 
both.  With  respect  to  the  level-B  scheduling  policy,  we  denote  ordinary  partitioned  RM  (P-RM) 
scheduling  with  an  unmanaged  cache  (i.e.,  no  cache  locking  or  scheduling)  as  UM,  cache  locking 
as  CL,  cache  scheduling  as  CS,  and  cache  locking  with  period  splitting  as  CL-PS.  We  do  not 
present  graphs  for  job  splitting  as  it  was  always  inferior  to  period  splitting  alone.  To  prevent 
interference  between  tasks  of  different  criticality  levels,  level-B  and  -C  tasks  are  each  assigned  half 
of  the  available  cache,  so  a  WSS  of  256K  corresponds  to  half  of  the  available  level-B  cache.  Our 
major  observations  are  as  follows. 

Observation  1.  In  all  observed  cases,  at  least  one  of  the  proposed  cache  management  solutions 
(CL,  CS,  or  CL-PS)  offered  improved  level-B  schedulability,  often  scheduling  task  systems  with 
almost  50%  greater  processor  utilization  than  an  unmanaged  P-RM  level-B  subsystem  (UM). 

This  observation  is  corroborated  by  insets  (a)-(e)  of  Figure  8,  which  depict  level-B  schedulability. 
For  example,  in  Figure  8(a),  with  a  scaling  factor  of  three,  which  is  less  than  we  observed  in 
practice,  cache  scheduling  (CS)  could  schedule  task  systems  with  a  level-B  utilization  of  50% 
more  than  that  of  an  unmanaged  P-RM  system  (UM).  By  scaling  level-B  tasks’  level-B  execution 
times  down  (to  account  for  fewer  cache  misses  due  to  cache  management),  we  are  able  to  schedule 
task  systems  that  previously  would  have  been  unschedulable  because  they  would  have  overutilized 
the  available  processing  cores.  This  effect  is  even  more  pronounced  with  larger  scaling  factors. 
These  results  show  that  the  benefits  of  cache  management  techniques  may  outweigh  the  increased 
overheads  they  entail. 

Observation  2.  For  some  system  configurations,  unmanaged  P-RM  (UM)  offered  improved  level- 
B  schedulability  over  one  or  more  cache  management  techniques.  This  suggests  that  the  cache  can 
be  managed  improperly. 

In  the  level-B  schedulability  results  presented  in  insets  (a)-(d)  of  Figure  8,  unmanaged  P-RM  (UM) 
outperforms  cache  locking  (CL)  in  many  cases  and  in  Figure  8(d),  unmanaged  P-RM  (UM) 
outperforms  cache  scheduling  (CS).  However,  as  described  in  Observation  1,  at  least  one  of  our 
cache  management  solutions  outperforms  unmanaged  P-RM  (UM).  Thus,  a  system  designer  must 
evaluate  their  task  system  to  decide  which  of  our  cache  management  solutions  is  best. 

Observation  3.  In  all  observed  cases,  period  splitting  improved  level-B  schedulability  under  cache 
locking. 

This  can  be  seen  in  insets  (a)-(e)  of  Figure  8.  Henceforth,  when  we  refer  to  cache  locking,  we 
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Figure  8:  Insets  (a)-(d)  compare  the  proposed  cache  management  approaches  in  terms  of  level-B  schedulability 
for  different  scaling  factors,  WSSs,  and  coloring  algorithms.  Inset  (e)  compares  the  schedulability  of 
the  proposed  cache  management  techniques  using  the  coloring  algorithm  that  provides  the  best 
schedulability  for  each  technique,  respectively.  Inset  (f)  depicts  level-C  schedulability. 
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assume  period  splitting  (CL-PS)  is  used. 


Observation  4.  Under  color-first  allocation,  cache  locking  (CL-PS)  offered  the  best  level-B 
schedulability,  while  under  way-first  allocation,  cache  scheduling  offered  the  best  level-B 
schedulability.  This  suggests  that  color  assignment  has  significant  impact  on  level-B  schedulability. 

Comparing  level-B  schedulability  using  way- first  and  color-first  allocation  in  insets  (b)  and  (d)  of 
Figure  8,  respectively,  we  observe  that  cache  locking  (CL-PS)  performs  much  better  under  color- 
first  allocation.  This  is  because  the  blocking  bound  for  cache  locking  is  0(mr/k )  where  r  is  the 
maximum  number  of  ways  of  a  color  required,  which  is  minimized  under  color- first  allocation, 
and  k  is  the  total  number  of  ways.  In  contrast,  cache  scheduling  has  much  better  level-B 
schedulability  under  way-first  allocation,  but  performs  comparatively  worse  under  color-first 
allocation.  This  is  because  of  the  pessimism  of  our  analysis,  which  assumes  that  only  one  task 
executes  with  a  color  at  a  time,  regardless  of  the  number  of  ways  required  (recall  from  Section  3.4 
that  tighter  analysis  for  cache  scheduling  is  an  open  problem).  This  pessimism  is  particularly 
detrimental  to  level-B  schedulability  using  color-first  allocation.  However,  under  way-first 
allocation,  tasks  use  all  of  the  ways  of  comparatively  fewer  colors,  which  in  turn  reduces  the 
number  of  cache  processors,  and  improves  schedulability.  These  observations  demonstrate  the 
impact  that  the  coloring  algorithms  have  on  level-B  schedulability. 

Observation  5.  The  overheads  associated  with  our  cache-management  techniques  only  minimally 
impacted  level  C. 

Our  techniques  have  a  profound  impact  at  level  B.  However,  as  can  be  seen  in  Figure  8(f), 
schedulability  at  level  C  is  decreased  by  a  comparatively  small  amount.  While  this  offset  means 
we  can  schedule  no  additional  level-C  work,  we  could  execute  many  more  high-criticality  level-B 
tasks  in  lieu  of  lower-criticality  level-C  tasks. 

Observation  6.  In  all  observed  cases,  cache  scheduling  under  way- first  allocation  provided  better 
level-B  schedulability  than  any  other  cache  management  technique  under  either  way-first  or  color- 
first  allocation. 

This  can  be  seen  in  insets  (a)-(e)  of  Figure  8.  Under  color-first  allocation,  as  in  Figure  8(d),  level-B 
schedulability  under  cache  scheduling  (CS)  was  less  than  that  of  cache  locking  (CL-  PS). 
However,  in  all  cases  in  which  cache  locking  (CL-PS)  outperforms  cache  scheduling  using  color- 
first  allocation,  level-B  schedulability  using  cache  scheduling  under  way-first  allocation  was 
greater.  This  can  be  seen  in  Figure  8(e),  which  shows  for  each  technique  the  best  level-B 
schedulability  seen  under  either  coloring  scheme. 

Cache  scheduling  also  has  practical  benefits  over  cache  locking.  Cache  scheduling  is  easier  to 
implement,  being  little  more  than  a  multiprocessor  RM  algorithm  with  an  extra  “mapping”  step, 
while  cache  locking  uses  both  partitioned  RM  and  a  locking  protocol.  Additionally,  cache 
scheduling  is  more  flexible  than  cache  locking  (which  uses  period  splitting).  For  example,  it  can  be 
more  easily  applied  to  schedule  non-periodic  task  systems.  Thus,  unless  a  system  is  unable  to  use 
way-first  page  allocation,  cache  scheduling  is  the  recommended  cache-management  technique. 
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5  CONCLUSIONS 


In  this  project,  we  devised  several  techniques  for  managing  shared  caches  on  multicore  systems 
within  the  MC  mixed-criticality  scheduling  framework.  We  also  experimentally  evaluated  an 
implementation  of  these  techniques  on  a  quad-core  ARM  machine.  These  experiments  indicate 
that  proper  shared  cache  management  can  lessen  WCETs  and  positively  impact  schedulability 
despite  increased  system  overheads.  In  fact,  in  our  experiments,  task  systems  could  be  scheduled 
successfully  with  cache  management  that  had  total  utilizations  exceeding  the  capacity  of  our  test 
platform  by  50%  or  greater  without  cache  management.  The  development  of  these  cache 
management  techniques  as  well  as  MC2  was  motivated  by  ongoing  work  with  colleagues  at  NGC 
on  defining  useful  real-time  resource  allocation  infrastructure  for  future  UAVs.  These  UAVs 
represent  an  interesting  challenge  problem  in  work  on  cyber-physical  systems.  Much  of  the  work 
included  in  this  document  appears  in  [58],  which  will  receive  an  Outstanding  Paper  Award  at 
ECRTS  2013. 

Future  work.  This  work  opens  many  avenues  for  future  research,  which  we  plan  to  pursue, 
ranging  from  systems-level  issues  to  theoretical  scheduling  problems.  For  example,  we  plan  to 
explore  cache  management  techniques  for  shared  libraries,  dynamic  memory,  and  jobs  with  WSSs 
larger  than  the  cache.  We  also  plan  to  evaluate,  from  the  perspective  of  rigorous  timing  analysis, 
the  improvements  to  calculated  WCETs  made  possible  by  these  cache  management  techniques. 
Additionally,  we  are  interested  in  developing  scheduling  algorithms,  schedulability  tests,  and 
resource  allocation  techniques  that  can  support  both  greater  system  utilizations  (with  respect  to 
both  the  processors  as  well  as  the  cache)  and  more  than  two  criticality  levels. 

Other  research  results.  In  addition  to  shared  cache  management,  we  have  investigated  a  variety 
of  other  topics  that  are  related  to  the  research  covered  above.  Some  of  this  work  was  co-funded  by 
other  agencies  (namely  NSF,  ARO,  and  AFOSR).  Such  topics  include  the  following. 

•  GPU-enabled  multicore  real-time  systems:  Graphics  processing  units  (GPUs)  can  be  used  in 
real-time  systems  to  significantly  speed  up  computations  that  are  parallelizable.  However, 
predictably  managing  GPUs  so  that  real-time  constraints  can  be  ensured  is  challenging.  We  have 
shown  that  the  needed  predictability  can  be  attained  by  using  multiprocessor  real-time  scheduling 
and  synchronization  techniques  to  control  GPU  accesses  [20,  22,  23,  25], 

•  Optimal  real-time  multiprocessor  locking  protocols:  Real-time  applications  require  locking 
protocols  for  controlling  accesses  to  shared  resources  that  are  amenable  to  real-time  analysis.  In 
recent  years,  the  advent  of  multicore  technologies  has  led  to  a  renewed  interest  in  real-time  locking 
protocols  for  multiprocessors.  In  a  series  of  papers,  we  have  presented  a  variety  of  such  protocols, 
for  both  mutual  exclusion  (mutex)  and  reader/writer  synchronization  [11,  12,  13,  21,  56,  57].  (Two 
of  these  papers  won  best  paper  awards  [11,  12],  at  RTSS  2010  and  EMSOFT  2011,  respectively.) 
Our  protocols  are  the  first  to  be  asympotitically  optimal  despite  over  two  decades  of  research  on 
this  topic.  As  seen  herein,  we  leveraged  this  work  in  designing  our  synchronization-oriented  cache 
management  framework. 

•  SRT  systems  with  self-suspensions  and  precedence  constraints:  In  practice,  real-time  tasks 
may  interact  with  each  other  or  external  devices.  Such  interactions  can  result  in  task 

APPROVED  FOR  PUBLIC  RELEASE;  DISTRIBUTION  UNLIMITED 
22 


suspensions  and  precedence  constraints,  both  of  which  may  complicate  real-time  schedulability 
analysis.  We  have  produced  several  analysis  techniques  for  dealing  with  such  complications 
when  computing  tardiness  bounds  for  SRT  tasks  [37,  38,  39,  40,  41]. 

•  Middleware  and  RTOS  issues:  In  this  project,  we  have  assumed  that  mixed-criticality 
resource  allocation  techniques  should  be  implemented  within  the  RTOS.  However,  through 
interactions  with  employees  of  RTOS  companies  embedded  at  NGC,  we  learned  that  a 
complete  RTOS  implementation  may  be  unlikely  due  to  market  forces  and  other  reasons.  Thus, 
an  appropriate  factoring  of  functionality  between  the  RTOS  and  middleware  is  needed.  We 
have  published  two  papers  in  which  such  issues  are  considered  [44,  45],  We  have  also 
published  several  papers  in  which  implementation-oriented  tradeoffs  involving  real-time 
schedulers  (including  MC2)  have  been  investigated  [8,  9,  24,  29,  31]. 
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CS 
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ECRTS 
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J-UCAS 

G-EDF 

HRT 

LbM 

LITMUSrt 

MC2 

NGC 

N-UCAS 

OS 

P-EDF 

P-RM 

PI-Blocking 

PREM 

PS 

RM 

RTOS 

S-Aware 

SRT 

SWaP 

TDMA 

UAV 

WCET 

WS 

WWS 


Complex  Adaptive  Systems 
Cache  Locking 
Cache  Scheduling 
Earliest  Deadline  First 

Euromicro  Technical  Committee  on  Real-Time  Systems 
First  In  First  Out 

Joint  Unmanned  Combat  Air  System 
Global  EDF 
Hard  Real-Time 
Lockdown  by  Master 

Linux  Testbed  for  Multiprocessor  Scheduling  in  Real-Time  systems 

Mixed-Criticality  on  Multi-Core 

Northrop  Grumman  Corp 

Navy’s  Unmanned  Combat  Air  System 

Operating  System 

Partitioned  EDF 

Partitioned  Rate  Monotonic 

Priority  Inversion  Blocking 

PRedictable  Execution  Model 

Period  Splitting 

Real-Time  Nested  Locking  Protocol 

Real-Time  Operating  System 

Suspension-Aware 

Soft  Real-Time 

Size,  Weight,  and  Power 

Time  Division  Multiple  Access 

Unmanned  Air  vehicle 

Worst-Case  Execution  Time 

Working  Set 

Working  Set  Size 
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